Knowledge Discovery in Documents by Extracting Frequent Word Sequences
نویسنده
چکیده
AS ONE APPROACH TO ADDRESS T H E N E W INFORMATION needs caused by the increasing amount of available digital data, the notion of knowledge discovery has been developed. Knowledge discovery methods typically attempt to reveal general patterns and regularities in data instead of specific facts, the kind of information that is hardly possible for any human being to find. In this article, a method for extracting iiiuximtil ft-~quent sequences in a set of documents is presented. A maximal frequent sequence is a sequence ofwords that is frequent in the document collection and, moreover, that is not contained in any other longer frequent sequence. A sequence is considered to be frequent if it appears in at least n documents when n is the frequency threshold given. Frequent maximal sequences can be used, for instance, as content descriptors for documeiit ment is represented as a set of sequences, which can then be used to discover other regularities in the document collection. As the sequences are frequent, their combination of words is not accidental. Moreover, a sequence has exactly the same form in many documents, providing a possibility to do similarity mappings for information retrieval, hypertext linking, clustering, and discovery of frequent co-occurrences. A set of sequences, particularly the longer ones, as such may also give a concise summary of the topic of the document. INTROIIUCTION The research field of knowledge discovery in databases (or data mining) has in the last years produced methods for finding patterns and Helena Ahonen, Wilhelm-Schickard-Institut fur Informatik, IJnirersity o f Tubingen, Sand 13, D-72076 Tubingen, Germany LIBRARY TRENDS, Vol. 48, No. 1,Summer 1999, pp. 160-181 01999 The Board of Trustees, University of Illinois AHONEN/EXTRACTING FREQUENT WORD SEQUENCES 161 regularities in structured data, mainly in databases. The studies have included efforts to utilize the existing data about, for example, clients, products, and competition. For instance, patterns in client behavior have been extracted. Unstructured data, particularly free running text, place new demands on knowledge discovery methodology. The representations of knowledge discovered are typically sets of frequently co-occurring items, or clusters of items, that seem to behave similarly in some sense. When the data are structured, they are usually easy to define, that is, what are the parts of data-the occurrence or behavior of the data-that are interesting? Regarding unstructured data, however, this is not at all obvious. When documents are concerned, the words of these documents may appear to be natural item candidates. The more established fields of information retrieval and natural language processing have traditionally concentrated on words and phrases. The phrases may be linguistic phrases, usually noun phrases, or statistical phrases, which are most often frequent noun-noun or adjective-noun pairs. In data mining research, the solution has been to use keywords from a controlled vocabulary (Feldman & Dagan, 1995; Feldman, Dagan, & Klosgen, 1996), names (of people, companies, etc.), or rigid technical terms that usually are noun phrases. However, verb phrases may also carry important hints on acts and processes similar to the following sequences of words: bank england provided money market assistance board declared stock split payable april boost domestic demand In this article, a method for extracting frequent word sequences from documents is presented. These sequences can be used as items for further knowledge discovery, but they already represent nontrivial and useful knowledge about documents and the entire document collection. Particularly, the method is able to find the maximal frequent word sequences, which are sequences ofwords that are frequent in the document collection and, moreover, that are not contained in any other longer frequent sequence. A sequence is considered to be frequent if it appears in at least n documents, when n is the frequency threshold given. The specific demands, due to the characteristics of textual data, include the facts that frequent sequences can be very long and that the frequency threshold has to be set rather low to find any interesting sequences. Frequent maximal sequences can be used, for instance, as content descriptors for documents: a document is represented as a set of sequences, which can then be used to discover other regularities in the document collection. As the sequences are frequent, their combination of words is not accidental, and a sequence has exactly the same form in many documents, giving a possibility to do similarity mappings for information retrieval, hypertext linking, clustering, and discovery of frequent 162 LIBRARY TRENDS/SUMMER 1999 co-occurrences. A set of sequences, particularly the longer ones, may also give a concise summary of the topic of the document. In the next section of this discussion, the entire word sequence discovery process is described, including preprocessing the documents, discovery of maximal sequences, assessing the quality of the discovered word sequences and, finally, the usage possibilities of the sequences. The last section presents experiments conducted using a newswire collection. EXTRACTINGFREQUENT WORDSEQUENCES The core of knowledge discovery are the algorithms that extract regular patterns in the data. In a practical application domain, however, it is essential to see knowledge discovery not just consisting of fast algorithms but as a process that starts from the data as it is available and ends up in the use of the discovered knowledge for some purpose. In the context of this article, the knowledge discovery process contains the following phases: Preprocessing of the documents Discovery of word sequences Ordering the word sequences Use of the discovered knowledge The starting point of the process is ordinary running text. The discovery method to be presented in this section has been developed with such a document collection in mind-i.e., the documents are rather brief. Hence, if longer documents are to be used, it might be necessary to use some kind of fragmentation-e.g., consider each paragraph as a document or use some other advanced method for dividing text into fragments (Hearst, 1995; Heinonen, 1998). BASICDEFINITIONS In order to formulate the approach presented in this article, some terms have to be defined. Assume that there is a document collection that contains a set of documents. Each document can be seen as a sequence of words and word-like characters. Each word has a unique index that identifies the location of the word both in the document and in the document collection. For instance, in the following, portions of three documents can be seen: (The,70) (Congress,71) (subcommittee,72) (backed,73) (away,74) (from ,75) (mandat ing , 76) (specific,77) (retaliation, 78) (against,79) (foreign,80) (countries,81) (for,82) (unfair,83) (foreign, 84) (trade,85) (practices,86) (He,105) (u rged , l06 ) (Congress,l07) ( to , 108) (reject, 109) (provisions,llO) ( tha t , l l l ) (would,llZ) (mandate,llS) (U.S.,114) (retaliation,l15) (against, 116) (foreign,l17) (unfair,118) (trade,l19) (practices,lZO) AHONEN/EXTRACTING FREQUENT WORD SEQUENCES 163 (Washington,407) (charged,408) (France,409) (West,4lO) (Germany,411) (the,412) (U.K.,413) (Spain, 414) (and,415) (the,416) (EC,417) (Commission,418) (with,419) (unfair,420) (practices,421) (on,422) (behalf,423) (of,424) (Airbus,425) Actually, the documents are seldom processed as such but rather preprocessed before knowledge is extracted. The possibilities and benefits of preprocessing are discussed in detail later in this article. Usually knowledge discovery methods express the extracted knowledge as some kind of regular pattern appearing in the data. In the current approach, these patterns are represented as word sequences. Like the text itself, a word sequence consists of a sequence of words. It is said that a word sequence occurs in a document if all the words contained in the word sequence can be found in the document in the same order as within the word sequence. For instance, in the previous sample documents, the word sequence (retaliation, against, foreign, unfair; trade, practices) occurs in the first two documents in the locations (78, 79, 80, 83, 85,86) and (115, 116,117,118,119,120). The word sequence (unfair; practices) occurs in all the documents, namely in locations (83,86), (118,120), and (420, 421). Naturally, a very large number of word sequences can be found in any document, particularly if sequences of all lengths are considered. The set of all sequences, furthermore, does not contain any knowledge that would not be already contained in the text of the documents. On the contrary, any knowledge would be even more difficult to find. Hence, it is important to consider only word sequences that are frequent in the document collection. A word sequence is said to be frequent if it occurs in enough documents to equal at least the given frequency threshold. For instance, assuming that the frequency threshold is 10, a word sequence is frequent if it occurs in ten or more documents. Note that only one occurrence of a sequence in a document is counted: several occurrences within one document do not make the sequence more frequent. Different definitions in this respect are, of course, possible. To restrict the number of word sequences further, a maximal gap between words in a sequence is given. That is, the original locations of any two consecutive words of the sequence can have only n words between them at the most if n is the maximal gap. Without this restriction, a large number of short frequent sequences, the words of which are located very far away in the text, would be found-e.g., in the previous example, the sequences (Congress, foreign), (Congress, practices), and (against, practices). If a word sequence occurs in a document, it is also a subsequenceof the sequence of words that constitutes the document. In a similar way, a sequence s is a subsequence of any sequence s’ if all the words of s occur in s’ in the same order as in s. If some word sequence is frequent, all of its subsequences are frequent. Hence, if there exists a frequent word sequence of length 10, 164 LIBRARY TRENDS/SUMMER 1999 1,012 frequent word sequences of length 2-9 are returned. Similarly, if (dow, jones, industrial, average) is a frequent word sequence in the collection, all the following sequences are found: dow jones industrial average dow jones dow industrial dow average jones industrial jones average industrial average dotvjones industrial dow jones average jones industrial average Often, however, the longest possible sequences are the most interesting, and their subsequences do not give more information. Hence, a sequence is returned only if it is a maximalfrequent sequence. Aword sequence sis a maximal frequent sequence in the document collection if there does not exist any sequence .s’in the collection such that s is a subsequence of .$’ands’is frequent in the collection. That is, a frequent word sequence is maximal if it is not contained in any other frequent word sequence. Clearly, when a subsequence does not have any independent meaning, it is rather useless. However, sometimes a subsequence is much more frequent than the respective maximal frequent sequence, which indicates that the subsequence also appears in other contexts and, therefore, has a meaning beyond the maximal sequence context. For instance, the following two maximal frequent sequences contain clearly independent parts, which can be found if they also appear elsewhere-e.g., the name “Oskar Lafontaine”without the title “jinance minister”: finance minister Theo Waigel finance minister Oskar Lafontaine finance minister Theo Waigel Oskar Lafontaine
منابع مشابه
Discovery of Frequent Word Sequences in Text
We have developed a method that extracts all maximal frequent word sequences from the documents of a collection. A sequence is said to be frequent if it appears in more than documents, in which is the frequency threshold given. Furthermore, a sequence is maximal, if no other frequent sequence exists that contains this sequence. The words of a sequence do not have to appear in text consecutively...
متن کاملارائه مدلی برای استخراج اطلاعات از مستندات متنی، مبتنی بر متنکاوی در حوزه یادگیری الکترونیکی
As computer networks become the backbones of science and economy, enormous quantities documents become available. So, for extracting useful information from textual data, text mining techniques have been used. Text Mining has become an important research area that discoveries unknown information, facts or new hypotheses by automatically extracting information from different written documents. T...
متن کاملIntellectual structure of knowledge in Nanomedicine field (2009 to 2018): A Co-Word Analysis
Introduction: The Co-word analysis has the ability to identify the intellectual structure of knowledge in a research domain and reveal its subsurface research aspects. Objective: This study examines the intellectual structure of knowledge in the field of nanomedicine during the period of 2009 to 2018 by using Co-word analysis. Materials and Methods: This paper develops a sciento...
متن کاملHybrid Approach for Punjabi Text Clustering
Text Clustering is a text mining technique which is used to group similar documents into single cluster by using some sort of similarity measure and placing dissimilar documents into different clusters. Most of the popular clustering algorithms treats document as conglomeration of words and do not consider the syntactic or semantic relations between words. To overcome this drawback, some algori...
متن کاملNon-Contiguous Word Sequences For Information Retrieval
The growing amount of textual information available electronically has increased the need for high performance retrieval. The use of phrases was long seen as a natural way to improve retrieval performance over the common document models that ignore the sequential aspect of word occurrences in documents, considering them as “bags of words”. However, both statistical and syntactical phrases showe...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Library Trends
دوره 48 شماره
صفحات -
تاریخ انتشار 1999